-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
optimize threading of mha #20088
optimize threading of mha #20088
Conversation
d32a693
to
75787c5
Compare
Thanks @yufenglee for PR! Do you have before and after perf numbers for the repro in this issue #19924 ? |
on my local box, it take ~1.5ms before and ~0.5ms after. |
May need take a look at cost model approach to see why cost model cannot work properly since it is a fundamental for CPU EP. Maybe try use correct cost (like adding concat kv cost etc) to see whether it could resolve the issue as well. |
Please run benchmark for comparison of before/after:
|
Would change loop_len to batch_size * num_heads_ * sequence_length solve the issue? |
dc3d01b
75787c5
to
dc3d01b
Compare
The cost computation of ComputeVxAttentionScore is wrong. It should be sequence_length * v_head_size * total_sequence_length instead of sequence_length * v_head_size * sequence_length. Also fine-tuned the cost computation for data load and store. |
68f7c92
to
d4a32d3
Compare
### Description <!-- Describe your changes. --> The cost computation of ComputeVxAttentionScore is wrong. It should be sequence_length * v_head_size * total_sequence_length instead of sequence_length * v_head_size * sequence_length. The PR also fine-tuned the cost computation. on my local box with i9 cpu, the performance is same as unfused version, but it is much faster on an azure vm with 16 threads. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> microsoft#19924
Description
The cost computation of ComputeVxAttentionScore is wrong. It should be sequence_length * v_head_size * total_sequence_length instead of sequence_length * v_head_size * sequence_length.
The PR also fine-tuned the cost computation.
on my local box with i9 cpu, the performance is same as unfused version, but it is much faster on an azure vm with 16 threads.
Motivation and Context
#19924